Suffix Tree Based Chinese Document Feature Extraction and Clustering in RSS Aggregator
نویسندگان
چکیده
In RSS aggregator, the important issue is how to make the feeds information more manageable for RSS subscriber. In this paper, we propose a suffix tree based RSS feeds document clustering in Chinese RSS aggregator. We construct a suffix tree with meaningful Chinese words, and choose the phrases with high score given by a formula as document features. We cluster document using group-average algorithm with a new document similarity measure. The experiment results show that the new method can improve the quality of clustering in document “snippets” scenario, and the speed can meet the demand of “on the fly” clustering.
منابع مشابه
A new keyphrases extraction method based on suffix tree data structure for arabic documents clustering
Document Clustering is a branch of a larger area of scientific study known as data mining .which is an unsupervised classification using to find a structure in a collection of unlabeled data. The useful information in the documents can be accompanied by a large amount of noise words when using Full Text Representation, and therefore will affect negatively the result of the clustering process. S...
متن کاملA Novel Weighted Phrase-Based Similarity for Web Documents Clustering
Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerat...
متن کاملSuffix Tree Clustering on Post-retrieval Documents
Clustering is used to divide a collection of data into groups based on similarity of objects. With respect to IR, document clustering has been studied. An information retrieval (IR) system would always return a list of retrieved documents to the user. The post-retrieval documents can be clustered in order to help users browse and navigate the searching results. For this purpose, Zamir and Etzio...
متن کاملA semantics-based method for clustering of Chinese web search results
Information explosion is a critical challenge to the development of modern information systems. In particular, when the application of an information system is over the Internet, the amount of information over the web has been increasing exponentially and rapidly. Search engines, such as Google and Baidu, are essential tools for people to find the information from the Internet. Valuable informa...
متن کاملA New Cluster Merging Algorithm of Suffix tree Clustering
Document clustering methods can be used to structure large sets of text or hypertext documents. Suffix Tree Clustering has been proved to be a good approach for documents clustering. However, the cluster merging algorithm of Suffix Tree Clustering is based on the overlap of their document sets, which totally ignore the similarity between the non-overlap parts of different clusters. In this pape...
متن کامل